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Abstract 

In unpredictable increase in mobile apps , more and more threats migrate from outmoded PC client to mobile device. Compared 
with traditional windows Intel alliance in PC, Android alliance dominates in Mobile Internet, the apps replace the PC client 
software as the foremost target of hateful usage. In this paper, to improve the confidence status of recent mobile apps, we 
propose a methodology to estimate mobile apps based on cloud computing platform and data mining. Compared with 
traditional method, such as permission pattern based method, combines the dynamic and static analysis methods to 
comprehensively evaluate an Android applications The Internet of Things (IoT) indicates a worldwide network of 
interconnected items uniquely addressable, via standard communication protocols. Accordingly, preparing us for the 
forthcoming invasion of things, a tool called data fusion can be used to manipulate and manage such data in order to improve 
progression efficiency and provide advanced intelligence. In this paper, we propose an efficient multidimensional fusion 
algorithm for IoT data based on partitioning. Finally, the attribute reduction and rule extraction methods are used to obtain the 
synthesis results. By means of proving a few theorems and simulation, the correctness and effectiveness of this algorithm is 
illustrated. This paper introduces and investigates large iterative multitier ensemble (LIME) classifiers specifically tailored for 
big data. These classifiers are very hefty, but are quite easy to generate and use. They can be so large that it makes sense to use 
them only for big data. Our experiments compare LIME classifiers with various vile classifiers and standard ordinary ensemble 
Meta classifiers. The results obtained demonstrate that LIME classifiers can significantly increase the accuracy of 
classifications. LIME classifiers made better than the base classifiers and standard ensemble Meta classifiers. 
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1. Introduction 

Information overload problem stemmed from the fact that the increasing amount of data makes users harder and take 
more time to find their preferred items. This situation has promoted the development of recommender systems[l, 2], 
which is one of the most promising information filtering technologies that match users with the most appropriate items 
by learning about their preferences. Due to its simple algorithm and good interpretation for recommendations compared 
to model based methods, similarity based methods have been widely applied, which predict a user’s interest for an item 
based on the weighted combination of ratings of the similar users on the same item or the user on the similar items. 
The similar users are other users who tend to give similar rating on the same item, while the similar items are the items 
that tend to get similar rating from the same user. Therefore, the recommendation quality would mainly depend on the 
accuracy of similarity measurement for users and items. 

The general definition of data fusion [3,4] is that it is a formal framework that contains expressed means and tools for 
the alliance of data originating from different sources. It aims at obtaining information of greater quality: the exact 
definition of greater quality depends on the application. In the IoT environment, data fusion is also a framework that 
comprises theories, methods, and algorithms for interoperating and integrating multisource heterogeneous data from 
sensor measurements or other sources, combining and mining the measurement data from multiple sensors and related 
information obtained from associated databases, and achieving improved accuracy and more specific inferences than 
that obtained by using only a single sensor. 

It needs some discussions about the malware’s origins, provenances and spreading. 

1) Android platform allows users to install apps from the third-party marketplace that may make no efforts to verify 
the safety of the software that they distribute. 

2) Different market place has different defense utility and revocation policy for malware detection. 

3) It is easy to port an existing Windows-based botnet client to Android platform. 
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4) Android application developers can upload their applications without any check of trustworthiness. The 
applications are self-signed by developers themselves without the intervention of any certification authority. 

5) A number of applications have been modified, and the malwares have been packed in and spread through 
unofficial repositories. 

Graphs are the most commonly used abstract data structures in the field of computer science, and they enable a more 
complex and comprehensive presentation of data compared to link tables and tree structures. Many issues in real 
applications need to be described using a graphical structure, and the processing of graph data is required in almost all 
cases, such as the optimization of railway paths, prediction of disease outbreaks, the analysis of technical literature 
citation networks, emerging applications such as social network analysis, semantic network analysis, and the analysis 
of biological information networks. 

An efficient fusion algorithm for multidimensional IoT data based on partitioning. The basic idea of this algorithm is 
that a large data set with higher dimensions can be transformed into relatively smaller data sets that can be easily 
processed. Therefore, firstly, we partition the high dimensional data set into certain blocks of lower dimensional data 
sets. Then, we compute the core attribute set of each block of data. Thereafter, we take the advantage of the core 
attribute sets of all data subset to determine a global core attribute set. Finally, based on this global core attribute set, we 
compute the reduction and mine the correlations among the multidimensional measurement data and certain interesting 
states with regard to the facilities or humans. 

2. Related Work 

The user rating data to compute the similarity between users or items. This is used for making recommendations. This 
was an early approach used in many commercial systems. It's effective and easy to implement. Typical examples of this 
approach are neighborhood-based CF and item-based/user-based top-N recommendations. For example, in user based 
approaches, the value of ratings user 'u' gives to item 'i' is calculated as an aggregation of some similar users' rating of 
the item: 


Item-based collaborative filtering 


Estimate similarity between items as Pearson 
correlation of rankings from users who have 
rated both items. 


sim(i,j ) = 
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Figure 1. Item based collaborative filtering 

Where 'U' denotes the set of top 'N' users that are most similar to user 'u' who rated item 'i'. Some examples of the 
aggregation function include: 
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where k is a normalizing factor defined as ? u is the average rating of user u for all the items rated by u. The 
neighborhood-based algorithm calculates the similarity between two users or items, produces a prediction for the user 
by taking the weighted average of all the ratings. Similarity computation between items or users is an important part of 
this approach. Multiple measures, such as Pearson correlation and vector cosine based similarity are used for this. 
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The Pearson correlation similarity of two users x, y is defined as 

E - r x )(r y?t - f y ) 

x,y) = - 


E (r^-r x Y E ( r v,i~ r v ) 2 


where I xy is the set of items rated by both user x and user y. The cosine-based approach defines the cosine- similarity 
between two users x and y as: [1] 
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The user based top-N recommendation algorithm uses a similarity-based vector model to identify the k most similar 
users to an active user. After the k most similar users are found, their corresponding user-item matrices are aggregated 
to identify the set of items to be recommended. A popular method to find the similar users is the Locality-sensitive 
hashing, which implements the nearest neighbor mechanism in linear time. The advantages with this approach include: 
the explain ability of the results, which is an important aspect of recommendation systems; easy creation and use; easy 
facilitation of new data; content-independence of the items being recommended; good scaling with co-rated items. 
There are also several disadvantages with this approach. Its performance decreases when data gets sparse, which occurs 
frequently with web-related items. This hinders the scalability of this approach and creates problems with large 
datasets. Although it can efficiently handle new users because it relies on a data structure, adding new items becomes 
more complicated since that representation usually relies on a specific vector space. Adding new items requires 
inclusion of the new item and the re-insertion of all the elements in the structure. 
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Applications Services Ecosystem and 
Delivery Framework 



Figure 2: Multidimensional IoT data 

Recently, one of the most popular research topics in data fusion for IoT is the interoperability and integration [5, 6] of 
multisource heterogeneous data, including IoT data abstraction [10, 11] and access, linked sensor data[12], 
resource/service search and discovery[13], and semantic reasoning and interpretation [14]. These studies are largely 
based on semantic Web technologies. Another popular research topic is big data management and mining [15-17] for 
gleaning useful information from the massive amount of data generated by such networks. These studies are mainly 
based on the data fusion theory and algorithm and the distributed information system technology [18]. In this paper, the 
proposed efficient fusion algorithm for multidimensional IoT data based on partitioning is related to a fusion method 
for big data. This algorithm focuses on the manner of improving the computational efficiency of data with higher 
dimensions. The fusion results will be discussed in future works. The program analysis such as data-flow analysis and 
visualization of control flow graph. They analyzed bout 136 000 benign apps and 6100 malicious apps, and their results 
confirm the previous observations for smaller app sets; what’s more, their results provide some new insights into 
typical Android apps. It proposed airmid, which uses collaboration between in-network sensors and smart devices to 
identify the provenance of malicious traffic. They created three mobile malware samples, i.e., Loudmouth, 2Faced, and 
Thor, to testify the correctness of airmid. Airmid’ s remote repair design consists of an on-device attribution and 
remediation system and a server-based infection detection system. Once detected, the software executes repair actions to 
disable malicious activity or to remove malware entirely. 



o 


Figure: System Architecture Overview 

Figure 3: System Architecture Overview 


3.INFRASTRUCTURE CLOUD PLATFORM 

Apache Cloud Stack is open source software designed to deploy and manage large networks of virtual machines, as a 
highly available, highly scalable Infrastructure as a Service (IAAS) cloud computing platform. Cloud Stack is used by a 
number of service providers to offer public cloud services, and by many companies to provide an on-premises (private) 
cloud offering, or as part of a hybrid cloud solution. 
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Cloud Stack is a turnkey solution that includes the entire "stack" of features most organizations want with an IAAS 
cloud: compute orchestration, Network-as-a-Service, user and account management, a full and open native API, 
resource accounting, and a first-class User Interface (UI). 

CloudStack currently supports the most popular hypervisors: VMware, KVM, Citrix XenServer, Xen Cloud Platform 
(XCP), Oracle VM server and Microsoft Hyper- V. 

Users can manage their cloud with an easy to use Web interface, command line tools, and/or a full-featured RESTful 
API. In addition, Cloud Stack provides an API that's compatible with AWS EC2 and S3 for organizations that wish to 
deploy hybrid clouds. 



Figure 4: Infrastructure cloud platform based on Cloud stack 


As we have seen (Sections X-A and X-B), a probabilistic machine can help to identify probable errors in big data. But 
contradictory as it may seem, a consequence of working with probabilities_for both people and machines_is that 
mistakes may be made. We may bet on ""Desert King" that ""Midnight Lady" is the winner. And in the same way that 
people can be misled by a frequently-repeated lie, probabilistic machines are likely to be vulnerable to systematic 
distortions in data.These observations may suggest that we should stick with computers in their traditional form, 
delivering precise. 

There are reasons to believe that computing and mathematics are fundamentally probabilistic: ""I have recently been 
able to take a further step along the path laid out by Godel and Turing. By translating a particular computer program 
into an algebraic equation of a type that was familiar even to the ancient Greeks, I have shown that there is randomness 
in the branch of pure mathematics known as number theory. My work indicates that_to borrow Einstein's 
metaphor_God sometimes plays dice with whole numbers.". 


VISUALISATION 

Methods for visualization and exploration of complex and vast data constitute a crucial component of an analytics 
infrastructure". Requires attention is the integration of visualization with statistical methods and other analytic 
techniques in order to support discovery and analysis.". 


In the analysis of big data, it is likely to be helpful if the results of analysis, and analytic processes, can be displayed 
with static or moving images. 



Figure 5: SP system 
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The SP system has three main strengths: 

Transparency in the representation of knowledge. By contrast with sub-symbolic approaches to arti_cial 
intelligence, there is transparency in the representation of knowledge with SP patterns and their assembly into 
multiple alignments. Both SP patterns and multiple alignments may be displayed as they are or, where appropriate, 
translated into other graphical forms such as tree structures, networks, tables, plans, or chains of 
inference. 

Transparency in processing. In building multiple alignments and deriving grammars and encodings, the SP system 
creates audit trails. These allow the processes to be inspected and could, with advantage, be displayed 
with moving images to show how knowledge structures are created. 

The DONSVIC principle. As previously noted the SP system aims to realize the DONSVIC principle and is proving 
successful in that regard. This means that structures created or discovered by the system_entities, classes of entity, and 
so on_should be ones that people regard as natural. Those kinds of structures are also likely to be ones that are well 
suited to representation with static or moving images. 

4.Evaluation 

Operations for analysis 

The data set is collected during the three-month period from May 1st to July 31st in 2012. The size of data set is about 
1 TB zipped logs (expanded size above 10 TB). Totally there are about 100 000 active Android apps in logs. We 
downloaded Android apps from App China to verify based on MobSafe. Each downloaded Android app has its web 
page on the market website. We also crawled the web version of the Android market to supply Android app with text 
description. We also conduct some correct proof by self- written malware verification. Figure 3 shows the total number 
of active apps in App China keeps steadily increase during these three months. It maintains a growth rate above 10%. 
From all these resolution Android devices account for about 90% of total Android devices. We also notice that high 
resolution display Android device users increase steadily while some middle resolution display Android device users 
decrease steadily. We classify the Android devices into three categories: Low class, Middle class, and High class 
according to the display resolution. It seems that the display resolution of Android devices is increased steadily in these 
three months4. It also needs to notice that the number of apps installed in mobile Android devices is about 30 according 
to three months’ statistics. 

Our experiments are devoted to evaluating the performance of LIME classifiers for the detection of malware using big 
data. It is critically important to conduct experiments and assess various classification schemes for processing of Big 
Data in particular areas. The outcomes of such experiments can be used to improve the performance of future practical 
implementations and can contribute to assessing further steps for future research. The performance of a classifier 
cannot be predicted on a purely theoretical basis. For any classification scheme that is able to produce very good 
outcomes in a specialized domain, there always exist other areas where different methods may turn out more effective. 
There are even theoretical results, known as vv no-free-lunch" theorems, which imply that there does not exist a single 
algorithm that performs best for all problems. We used 10-fold cross validation to evaluate the effectiveness of 
classifiers in all experiments. The following measures of performance of classifiers are often used in this research 
direction: precision, recall, F-measure, accuracy, sensitivity, specificity and Area under Curve also known as the 
Receiver Operating Characteristic or ROC area. Notice that weighted average values of the performance metrics are 
usually used. This means that they are calculated for each class separately, and a weighted average is found then. In 
contrast, the accuracy is defined for the whole classifier as the percentage of all instances classified correctly, which 
means that this definition does not involve weighted averages in the calculation. Precision of a classifier, for a given 
class, is the ratio of true positives to combined true and false positives. Sensitivity is the proportion of positives 
(malware) that are identified correctly. Specificity is the proportion of negatives (legitimate software) which are 
identified correctly. Sensitivity and specificity are measures evaluating binary classifications. For multi-class 
classifications they can be also used with respect to one class and its complement. Sensitivity is also called True 
Positive Rate. False Positive Rate is equal to 1 - specificity. These measures are related to recall and precision. Recall is 
the ratio of true positives to the number of all positive samples (i.e., to the combined true positives and false negatives). 
The recall calculated for the class of malware is equal to sensitivity of the whole classifier. 

In keeping with the long tradition in engineering of borrowing ideas from biology, the structure and functioning of 
brains provide reasons for trying to developed: 

• Since brains are composed largely of neural tissue, it appears that neurons and their inter-connections, with glial 
cells, provide a universal framework for the representation and processing of all kinds of sensory data and all other 
kinds of knowledge. 

• In support of that view is evidence that one part of the brain can take over the functions of another part This 
implies that there are some general principles operating across several parts of the brain, perhaps all of them. 

• Most concepts are an amalgam of several different kinds of data or knowledge. For example, the concept of a 
"picnic" combines the sights, sounds, tactile and gustatory sensations, and the social and logistical knowledge 
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associated with such things as a light meal in pleasant rural surroundings. To achieve that kind of seamless 
integration of different kinds of knowledge, it seems necessary for the human brain to be or to contain a UFK. 
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Figure 6: Comparison system 

5. CONCLUSION 

The computation of attribute reduction is proven to be a non-deterministic polynomial-time hard (NP-hard) problem. 
Therefore, IoT offers a formidable challenge in the computation and fusion of high-dimensional big data generated by 
the participating networks. Several theorems have been presented in order to illustrate the correctness of the proposed 
algorithm. Further, we perform a simulation to enumerate the better efficiency and effectiveness of the proposed 
algorithm. In a future study, the fusion results of the measurement data will be presented. The relationships between the 
number of dimensions, number of partitions, and volume of objects and their influence on the computation efficiency 
will be discussed. As mobile app market serves as the main line of defense against mobile malwares, it is practical to 
use cloud computing platform to defense malware in mobile app markets. We introduced and investigated four-tier 
LIME classifiers originating as a contribution to the general approach considered by many authors. We obtain new 
results evaluating performance of such large four-tier LIME classifiers. These new results show, in particular, that 
Random Forest performed best in this setting, and that novel four-tier LIME classifiers can be used to achieve further 
improvement of the classification outcomes. We carried out a systematic investigation of new automatically generated 
four-tier LIME classifiers, where diverse ensemble meta classifiers are combined into a unified system by integrating 
different ensembles at the third and second tiers as parts of their parent ensemble meta classifiers at the higher tier. 
They are effective if diverse ensemble meta classifiers are combined at different tiers of the LIME classifier. They have 
made significant improvements to the performance of base classifiers and standard ensemble meta classifiers. 
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